Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohammad Soleymani

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

May 26, 2026

Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani

Abstract:Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.

* Accepted as a conference paper at ICML 2026. Project page: https://voxparadox.github.io/

Via

Access Paper or Ask Questions

The manifold of unitary and symmetric matrices: characterization, Riemannian optimization and application to BD-RIS design

Apr 24, 2026

Ignacio Santamaria, Carlos Beltrán, Eduard Jorswieck, Mohammad Soleymani, Jesus Gutiérrez

Abstract:This paper proposes and analyzes Riemannian optimization algorithms on the manifold of unitary and symmetric matrices, denoted ${\cal {U}}_s$, which naturally models the scattering matrices of passive and reciprocal devices such as beyond-diagonal reconfigurable intelligent surfaces (BD-RISs). Despite its relevance, the geometry of ${\cal {U}}_s$ has remained largely unexplored, and existing BD-RIS optimization methods either ignore the symmetry constraint or rely on costly Takagi-based parameterizations. We first provide a rigorous geometric characterization of ${\cal {U}}_s$, deriving its tangent space, a simple retraction, and closed-form expressions for geodesics. Building on these results, we develop two Riemannian manifold optimization (MO) algorithms tailored to ${\cal {U}}_s$: a line-search (LS) based scheme and a phase-optimization (PO) update along geodesics. We then apply the proposed framework to BD-RIS-assisted multiple-input multiple-output (MIMO) links, addressing sum-gain maximization, rate maximization, and minimum mean-square error problems, where they outperform existing approaches. Furthermore, we show that when the number of BD-RIS elements exceeds the total number of antennas, the optimal scattering matrix is low-rank, which motivates and enables efficient low-rank variants of the proposed algorithms.

* 12 pages, 5 figures. arXiv admin note: text overlap with arXiv:2601.13877. text overlap with arXiv:2601.13877

Via

Access Paper or Ask Questions

Optimal symmetric low-rank BD-RIS configuration maximizing the determinant of a MIMO link

Apr 10, 2026

Ignacio Santamaria, Mohammad Soleymani, Jesus Gutierrez, Eduard Jorswieck

Abstract:Beyond-diagonal reconfigurable intelligent surfaces (BD-RISs) significantly improve wireless performance by allowing tunable interconnections among elements, but their design in multiple-input multiple-output (MIMO) systems has so far relied on complex iterative algorithms or suboptimal approximations. This work introduces a simple yet powerful approach: instead of directly maximizing the achievable rate, we maximize the absolute value of the determinant of the equivalent MIMO channel. We derive a closed-form symmetric unitary scattering matrix whose rank is exactly twice the channel's degrees of freedom ($2r$). Remarkably, this low-rank solution achieves the same determinant value as the optimal unitary BD-RIS. Using log-majorization theory, we prove that the rate loss relative to the optimal unitary BD-RIS vanishes at high signal-to-noise ratio (SNR) or when the number of BD-RIS elements becomes large. Moreover, the proposed solution can be perfectly implemented using a $q$-stem BD-RIS architecture with only $q=2r-1$ stems, requiring a minimum number of reconfigurable circuits. The resulting Max-Det solution is orders of magnitude faster to compute than existing iterative methods while achieving near-optimal rates in practical scenarios. This makes high-performance BD-RIS deployment feasible even with large surfaces and limited computational resources.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

Mar 26, 2026

Zhangyu Jin, Maksim Siniukov, Deuksin Kwon, Ashutosh Chaubey, Mohammad Soleymani

Abstract:Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

Via

Access Paper or Ask Questions

RIS-Aided RSMA Improves the Latency vs. Energy Trade-off in the Finite Block Length MIMO Downlink

Mar 16, 2026

Mohammad Soleymani, Bruno Clerckx, Robert Schober, Lajos Hanzo

Abstract:We simultaneously minimize the latency and improve energy efficiency (EE) of the multi-user multiple-input multiple-output (MU-MIMO) rate splitting multiple access (RSMA) downlink, aided by a reconfigurable intelligent surface (RIS). Our results show that RSMA improves the EE and may reduce the delay to 13\% of that of spatial division multiple access (SDMA). Moreover, RIS and RSMA support each other synergistically, while an RIS operating without RSMA provides limited benefits in terms of latency and cannot effectively mitigate interference. {Furthermore, increasing the RIS size amplifies the gains of RSMA more significantly than those of SDMA, without altering the fundamental EE-latency trade-offs.} Results also show that latency increases with more stringent reliability requirements, and RSMA yields more significant gains under such conditions, making it eminently suitable for energy-efficient ultra-reliable low-latency communication (URLLC) scenarios.

* Accepted at IEEE Open Journal of Vehicular Technology

Via

Access Paper or Ask Questions

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Mar 03, 2026

Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani

Abstract:Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

* CVPR 2026. Project Page: https://mod-dpo.github.io/

Via

Access Paper or Ask Questions

HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion

Feb 11, 2026

Di Chang, Ji Hou, Aljaz Bozic, Assaf Neuberger, Felix Juefei-Xu, Olivier Maury, Gene Wei-Chin Lin, Tuur Stuyck, Doug Roble, Mohammad Soleymani(+1 more)

Abstract:We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject's photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.

* Website: https://boese0601.github.io/hairweaver/

Via

Access Paper or Ask Questions

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

Feb 04, 2026

Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, Mohammad Soleymani

Abstract:Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at https://avere-iclr.github.io.

* Accepted as a conference paper at ICLR 2026. Project page: https://avere-iclr.github.io

Via

Access Paper or Ask Questions

Riemannian optimization on the manifold of unitary and symmetric matrices with application to BD-RIS-assisted systems

Jan 20, 2026

Ignacio Santamaria, Mohammad Soleymani, Eduard Jorswieck, Jesus Gutierrez, Carlos Beltran

Abstract:In this paper, we rigorously characterize for the first time the manifold of unitary and symmetric matrices, deriving its tangent space and its geodesics. The resulting parameterization of the geodesics (through a real and symmetric matrix) allows us to derive a new Riemannian manifold optimization (MO) algorithm whose most remarkable feature is that it does not need to set any adaptation parameter. We apply the proposed MO algorithm to maximize the achievable rate in a multiple-input multiple-output (MIMO) system assisted by a beyond-diagonal reconfigurable intelligent surface (BD-RIS), illustrating the method's performance through simulations. The MO algorithm achieves a significant reduction in computational cost compared to previous alternatives based on Takagi decomposition, while retaining global convergence to a stationary point of the cost function.

* 5 pages, 2 figures

Via

Access Paper or Ask Questions

Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Oct 02, 2025

Minh Tran, Maksim Siniukov, Zhangyu Jin, Mohammad Soleymani

Figure 1 for Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Figure 2 for Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Figure 3 for Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Figure 4 for Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Abstract:Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.

Via

Access Paper or Ask Questions